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ABSTRACT 


While the idea that querying mechanisms for complex 
relationships (otherwise known as Semantic Associations) 
should be integral to Semantic Web search technologies has 
recently gained some ground, the issue of how search results 
will be ranked remains largely unaddressed. Since it is expected 
that the number of relationships between entities in a knowledge 
base will be much larger than the number of entities themselves, 
the likelihood that Semantic Association searches would result 
in an overwhelming number of results for users is increased, 
therefore elevating the need for appropriate ranking schemes. 
Furthermore, it is unlikely that ranking schemes for ranking 
entities (documents, resources, etc.) may be applied to complex 
structures such as Semantic Associations. 


In this paper, we present an approach that ranks results based on 
how predictable a result might be for users. It is based on a 
relevance model SemRank, which is a rich blend of semantic 
and information-theoretic techniques with heuristics that 
supports the novel idea of modulative searches, where users may 
vary their search modes to effect changes in the ordering of 
results depending on their need. We also present the 
infrastructure used in the SSARK system to support the 
computation of SemRank values for resulting Semantic 
Associations and their ordering. 


Categories and Subject Descriptors 


H.3.3 [Information Systems] Information Search and Retrieval 


General Terms 
Algorithms, Experimentation, Measurement 


Keywords 
Semantic Web, SemRank, Semantic Ranking, Ranking Complex 
Relationships, Semantic Associations Search, Semantic 


Relationship Search, Semantic Match, Semantic Similarity, 
Discovery Query, Path Expression Tree, Semantic Summary 


1 INTRODUCTION 


The premise of search technologies today is primarily 
centered around enabling search for entities or on the Semantic 
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Web resources. However, the following quote by Grady Booch 
[5] summarizes the limitations of a purely entity-centric world 
view: 

“An object by itself is intensely uninteresting”. 

Similar sentiments were echoed in [23] where relationships were 
emphasized as the heart of the Semantic Web. Correspondingly, 
efforts must be made to extend or identify alternatives to 
traditional search mechanisms focused on finding documents 
described either by keywords or semantic annotations, with 
capabilities for searching about complex relationships between 
Semantic Web resources. Such search capabilities may become 
the foundation for a “Relationship Search Engine”, a technology 
with the potential for immense real world value in analytical and 
knowledge discovery applications [27]. Relationship search 
technologies will provides effective means to answers questions 
such as “Does a semantic relationship exist between X and Y”? 
One step in this direction is [1][3] where the notion of Semantic 
Associations is formalized and refers to complex relationships 
between resources capturing the connectivity or similarity of the 
resources. In graph theoretic terms, they refer to labeled paths 
(not necessarily directed) that either connect resources or that 
the resources have similarly. In addition, progress [4][18][3] is 
now being made in developing efficient evaluation strategies for 
discovering such relationships on the Semantic Web. 

Another important related issue that must be addressed by 
relationship search technologies is how to determine the relative 
importance of relationships found with respect to a user’s 
context because that impacts how they will be ranked. This 
problem is particularly important because it is possible, in fact 
likely, that the number of such relationships are much larger 
than the number of entities themselves. This means that there is 
potential for creating a more acute information overload 
problem than currently exists on the Web. It is therefore 
imperative that we develop techniques for ordering search 
results in order to present results of highest importance first. 
Unfortunately, it is not clear that the techniques used currently 
for ranking entities on the Web e.g. for HTML, PageRank [7], 
HITS [14], for XML [13] [8], and on the Semantic Web, [23], 
can be used to rank complex relationships. There is some related 
work [26] being done in the area of ranking Semantic Networks. 
However, this approach suffers from the same limitation as most 
ranking approaches which have a fixed ranking scheme that 
imposes a single type of ordering on results. That is that the 
same query made in different contexts and for different 
purposes, still yields the same ordering. It seems that some 
flexibility should be built into the relevance models so that 


different orderings may be imposed on the same result set 
depending on the user’s need. For example, in an investigative 
context the focus of a search may be to uncover obscure 
relationships between entities (e.g., it is suspected that in money 
laundering, innocuous dealings/relationships are purposefully 
introduced to further obscure relationships [27]), whereas the 
focus of a conventional search may be to find predictable or 
commonly expected relationships for the purpose of validating 
or augmenting already known information. Therefore, in the 
earlier scenario, we may need to boost results that are 
considered unpredictable whereas in the latter case we may need 
to reverse that ordering. 

The challenge here of course is in defining the metrics that 
will be used for determining the ordering of results. When using 
IR style techniques over structured or semi-structured data, the 
ordering of results is based on how close a document is to a 
query (i.e., an explicit description of a user’s need). In context 
of a relationship search however, queries do not contain a 
description as such, they may just identify the entities of interest. 
Consequently, a relevance model that is based on how good of a 
match a document is to a query does not apply and the 
development of novel techniques is necessary. One promising 
approach to dealing with this problem is based on using metrics 
that somehow measure the predictability of the result that is 
being returned. For example, we may choose to rank highest in 
an investigative or discovery search, results that are less 
predictable while in a conventional search the reverse ordering 
more desirable. 


1.1 Outline and Contributions 

e In this paper, we focus on ranking the results of complex 
relationship searches on the Semantic Web. We pursue 
an approach that is based on a modulative relevance 
model SemRank, that can easily (using a sliding bar) be 
modulated or adjusted via the query interface. In this 
way, a user can easily vary their search mode from a 
Conventional search mode to a Discovery search 
mode based on their need. The richness of the SemRank 
relevance model stems from the fact that it uses a blend 
of semantic and information theoretic techniques along 
with heuristics to determine the rank of Semantic 
Associations. It also has the advantage of being a unified 
model for ranking all the different types of Semantic 
Associations. 

e We discuss the infrastructure provided by the SSARK 
(Semantic Searching of A different Kind) system for the 
computation of SemRank values and ordering of results 
based on these values. Some key components of the 
infrastructure include the idea of modulative searches 
used to support anywhere from conventional to discovery 
searches; the notion of Semantic Summaries 
analogous to the notion of structural summaries used for 
optimizing path expression query evaluation; a pipelined 
Top-K algorithm for computing approximately correct 
orderings of search results. 


The rest of the paper is organized as follows: section 2 presents 
some background information on Semantic Associations, section 
3 discusses the issue of ranking Semantic Associations and 
presents the components of the SemRank model. Section 4 
discusses the computational issues with respect to computing 
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SemRank values and result ordering and presents the strategies 
adopted by the SSARK system. Section 5 presents an empirical 
evaluation of our approach and sections 6 and 7 discuss related 
work and conclusion respectively. 


2 BACKGROUND AND MOTIVATION 


Semantic Associations were based on the notion of RDF 
Property Sequences whose instances can be viewed as 
labeled paths in a knowledge base. A knowledge base in our 
context refers to a set of RDF descriptions or OWL-Lite 
descriptions represented in RDF. They also include certain 
binary relations on Property Sequences that capture the 
intersection and similarity of paths in which entities are 
involved. Figure 1 shows some examples of Semantic 
Associations. Figure l(a) shows a direct path connecting 
resources r1 and r2 and is calleda p-pathAssociation. 


(a) p-Path Association between a and b (b) p-Iso Association between a and b 


(c) p-Join Association between a and b 


Figure 1: Semantic Associations 


Figure 1(c) is called a p-joinAssociation between r1 and 
r2. The rationale for suggesting a relationship between r1 and 
r2 is that the paths pı = py), Pio, -- Pin Originating from r1 and 
P2 = Pr, P22, -- Pam Originating from r2 join at node rn. In 
other words, r1 and r2 meet at some point. Another type of 
association called a p-isoAssociation (shown in Figure 
1(b)) indicates a similarity relationship between resources. The 
paths p = py, Pp ---- Pin Originating from r1 and p’= py)’, py’, 
.... Pin’ Originating from r2 are semantically similar in that the 
corresponding edges in both paths are related in a subproperty 
relationship (i.e.,, for each pı; either pı; is rdfs:subPropertyOf 
pı or vice versa), therefore r1 and r2 are related by virtue of 
this similarity. 

As mentioned earlier, Semantic Associations are based on the 
notion of Property Sequences. Let us assume the following 


interpretation functions for a class c and a property p, [| T, 

[J „ana [[ ]] such that: 

i) [fe] = {r |r is rdftype of c} i.e., only proper instances 
ofc 

i) [p= tr ]ine {Hilel lee p.domain }, 
ne { [e Il |ce p.range } i.e., only proper instances of 
P 


iii) [e= llep} U Tet } where c is 


rdfs:subClassOf c' i.e., proper instances of c and the 


superclasses of c. 
{loeb} where p is 


w (el = tb} 
rdfs:subPropertyOf p' i.e., proper instances of p and the 


superclasses of p. 


v [ec] = ilef} U lef} where c is 


rdfs:subClassOf c i.e., all instances of c including its 
subclasses. 


v) [l = {fer} o ilo} where p is a 


rdfs:subPropertyOf p} all instances of p including its 
subproperties, 
A Property Sequence PS = P, P» ... P, is a finite 
sequence of properties whose interpretation is given by: 


[Ps]lcx", [P] such that 


1. for ps an instance of PS, i.e., ps € [[ PS ]], ps[i] = [n1, r2] € [[ 
P; |] for 1< i < n. (We use the notation ps[i][0] and ps[i][1] 
to refer to r,and r, respectively.) 

2. ps[i][1] = ps[it+1][0]. 

In the example above, Figure 1(a) shows a property sequence 

whose instance is the directed path from rl to r2. . Figure 1(c) 

shows two property sequences P11, Py2, -- Pin and P21, Pr, <- Pam 

that form a p-joinAssociation between rl and r2 and are 
called Joined Property Sequences while the two 
sequences of properties P11, P12, -- Pin aNd p21, P22, -- Pon In Figure 

1(b) that form a p- isoAssociation between rl and r2 are 

called p-Isomorphic Property Sequences. 


2.1 Motivating Example 
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Figure 2: An example RDF knowledge base 


We will motivate our work with a simple example. The top part 
of Figure 2 shows four schemas for four different domains 
University, Banking, Flight and Organization, 
while the bottom part shows a set of resources described using 
those schemas, i.e.„ the knowledge base. A Semantic 
Association search specifies a pair of resources and 
optionally some keywords, and the result is the set of p- 
pathAssociations, p-joinAssociations and p-isoAssociations that 
relate both resources. A semantic association search is 
sometimes referred to as a p-query because the search is done 
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using an operator p that extends traditional RDF query 
languages [3][2]. Table 1 below shows three Semantic 
Associations between resources rl and r6, a student and her 
professor. The first row contains a p-pathAssociation that 
connects the student rl to r6 via the purchase of a ticket that was 
paid for by r6 and row 2 connects rl directly to r6 with an 
advisee relationship. Row 3 shows a p-isoAssociation in which 
both rl and r6 have relationships that re similar, i.e.,, stock 
ownership. There are other semantic associations between rl 
and r6 that are not shown in the table e.g. the p-joinAssociation 
involving the path in row 1 and the other paths that lead to r6. 


Table 1: Example Semantic Associations 


&rl purchased 5 &r? forFlight yer paidBy 5 &r4 
accountHolder 5 &r5 leader > &r6 
2 &rl adviseeOf &r6 
3 &rl ownsStockIn 5 &r5 isAccountHolderOf 5 &rd 
& 16 ownsStockIn 5&9 isAccountHolderOf 5 &r8 


It is likely that the result of a search in a large knowledge base 
will be inundating, especially if the search includes multiple 
sources and destinations. This suggests a need to rank the results 
in some order of importance. On the other hand, it is not very 
clear how to ascribe importance to the results because of their 
complex and heterogeneous nature. 

In the next section, we will discuss the relevance model called 
SemRank which addresses the problem of how to rank these 
results. 


3 RANKING SEMANTIC 
ASSOCIATIONS 


It would appear that there are different possibilities for ranking 
semantic associations results e.g., by shortest paths, longest 
paths, least frequently occurring paths, etc. However, each of 
these approaches makes an assumption about what is most 
relevant for every situation. In our experience, we have found 
that different applications have different needs and making 
assumptions that fix the ranking schemes can be limiting. 
Consequently, two main features of our ranking scheme are 
customizability (allowing users to select an appropriate ranking 
scheme) and flexibility (allowing users to easily apply different 
ranking schemes on results so that results may be viewed using 
different perspectives). 

Fundamental to our ranking approach is the ability to measure 
how much information is conveyed by a result thereby giving a 
sense of how much information a user would gain by being 
informed about the existence of the result. This is closely related 
to the likelihood that a user could have guessed that such an 
association exists or the predictability of the association. 
Using such measure we may then rank results based on the 
search mode selected for the search. For example, if the context 
requires a conventional search then results that are deemed 
obscure and unpredictable will be ascribed the least importance. 
In Table 1, for example, we can say that the fact that a student is 
an advisee of a professor i.e., result 2, is probably the least 
surprising of the results and should be assigned the highest 
relevance when a conventional search is being performed. On 


the other hand, when a discovery search is performed, the other 
two results are candidates for the most relevant result. 

The factors used for measuring the predictability of a result 
include (i.) the uniqueness or specificity of the result and 
(ii.) how discrepant the structure of the result is from the 
possibilities that can be gleaned from the schema. In the case of 
specificity, it seems reasonable to conjecture that a commonly 
occutring association is more predictable than a rarely occurring 
association. The idea of uniqueness may be either with respect 
to the whole database or just to the set of resources of similar 
types. For example, it may be that the association is rare with 
respect to the entire database but frequent when considering 
only the set involving similar resources. We combine both 
notions of uniqueness to form a measure of information content 
of a result. 

The issue of a result’s discrepancy from schema arises because 
the multiple typing of resources by classes unrelated inheritance 
allowed by the RDF and OWL-Lite data models, often leads to 
paths at the data layer that cannot be predicted by just looking at 
the schemas. For example in Figure 2, the property sequence 
purchasedeforFlightepaidByeaccountholdereele 
ctedLeader does not occur anywhere at the schema layer but 
occurs in a path from rl to r6 because r4 and r5 are multiply 
classified. Deviations from schema represented paths are called 
refractions and paths with many refractions are unlikely to 
be easily anticipated by users, making them less predictable. 
Finally, we allow users to optionally specify some keywords that 
capture relevance and results which contain semantic matches 
are ranked highest. 

The rest of the section elaborates on these measures and how 
they are used to rank p-path associations. For brevity, we omit 
their application to other types of semantic associations. 


3.1 Information Gain and p-Path Semantic 


Associations 

In information theory, the amount of information contained 
in an event is measured by the negative logarithm of the 
probability of occurrence of the event. Thus if x is a discrete 
random variable or an event that has possible outcome values x,, 
X2, ..., Xn occurring with probabilities pr), pro, ..., ply 1-€.,, 
Pr(y=x;) = pr; with pr; > 0 and 2, ; pr; = 1, the amount of 
information gained or uncertainty removed by knowing that x 
has the outcome x; is given by 


I(x = xj) =— log pr; 


The maximum information content of ¥ is attained when pr; = 
1/n for all i. This is given by I(x) = log n. 


Based on this we can build a model for measuring the 
information content of a semantic association by considering the 
occurrence of edge as an event and RDF properties as its 
outcomes. We begin with defining the notion for a property and 
then extend it to a sequence of properties of a path. Assume that 
P is the set of all property types in a description base and x is a 
discrete random variable with sample space [[P]]. Then for any 
valid property p € P, the probability that % = p is given by 


see tpn" 


[uey 


A 
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We refer to this probability as the specificity SP of the 
property p. The specificity of a property is a measure of its 
uniqueness relative to all other properties in the description 
base. The information content of the occurrence of a property p 
I(x=p) in the description base due to its specificity is: 


Is(p) = (x=p) = —logPr( x =p ) 


It is possible develop a similar measure which exploits the 
semantics of RDF and RDFS. Given that RDF resources are 
typed, any two resources rl and r2 have a finite number of valid 
properties from that may connect them. Using this information, 
we can estimate the information content of a property p linking 
rl and r2 with respect to only the valid properties as 
possibilities. What we then expect is that information content 
will be larger in situations where the number of valid properties 
is large and smaller in situations with fewer valid properties. 
The valid properties that can link two resources include those 
that have been explicitly defined in a schema and those that may 
be inferred from the semantics of RDFS. In particular, the 
semantics of multiple domains/ranges on a property p implies 
that a resource must belong to all the domain/range classes even 
if that membership is not explicitly stated. We introduce the 
concept of a Representative Ontology Class (ROC) 
as a concise summary of related classes by virtue of the 
equivalence of their interpretations. For example, in Figure 2 
the classes Book and Ticket belong to an ROC which 
represents the set of things that can be purchased. We now 
make this notion more precise. 
Definition 1 Representative Ontology Classes. For an 
ontology O with the set of classes C and properties P and |C| =n, 
let S be an x n matrix with the following entries: 

Sij = Up, c; € domain^(p) nce range^ (p) 


where domain^/range^ refer to classes in the proper 
domain/range of a property (i.e., excluding their subclasses). 
We can define an equivalence relation ~ such that: 

~(i, j) iff Si, = Sand Ski= Sj, for all k. 
~ partitions the elements of S into the set £ of equivalence 
classes of ~, where each equivalence class represents the set of 
classes that are equivalent with respect to their outgoing and 
incoming properties. It corresponds to the set of classes that 
should have the same interpretations. Each 1 e £ is called a 
Representative Ontology Class (ROC). The set of 
all possible properties semLinks that can directly connect two 
ROCs X and Y is given by semLinks (X, Y) > Siwy) 
where i(l)e {i:Ciel}. il) is called a 
representative for the ROC 1 and C; is called a member 
of 1. Since the members of each equivalence class are 
equivalent with respect to their outgoing and incoming 
properties, then it suffices to pick a representative class for X 
and Y, say Ciœ and Cim) respectively. 
Now, given any two resources rl and r2, we can measure the 
probability distribution of the types of properties that can 
connect them in the instance base. If we let m be the set of all 
possible properties that may connect rl and r2, then m clearly 
depends on the types of rl and r2. If Cy, Co, .., Cm and Dy, ..., 
D, are the classes of rl and r2 respectively and if X,, X2.. X,, 
and Yı, Y2, ..Y, are ROCs that C; and D; belong to respectively, 
then 


m =UsemLinks(X,,Y,) and 0 = Uitte" per \ 


O represents the interpretation of all the valid properties that may 
connect rl and r2. Thus, the probability that y € © is given by: 


e] 
[uen] 


For a given valid property p in the description base, if x € 0, the 
probability that % = p is given by: 


Pr(ye 0)= 


E Pr(y=p.ze0)_|Ulel | 
eA Pr(ye 6) [o] 

We refer to this probability as the 0-Specificity SP» of 
property p. The 0-specificity of a property is a measure of its 
uniqueness relative to all other properties in the description base 
whose domain and range belong to the same ROC’s, 
respectively. The information content of the occurrence of a 
valid property p I(y=p | x¢9) in the description base due to its 
0-Specificity can then be defined as 


Io.s(P) = (x=p | XE 8) = —log Pr(y=plye 4). 


To illustrate these concepts, for the property “purchased” 
connecting &rl and &r2 in Figure 2 above, ROC, = {Student, 
Passenger}, ROC, = {Book, Ticket} with 0 = 
{{[purchased]]’, [[bidsFor]]"}, so that the size of 0 = | [[ 
purchased ]] |+ | [[ acquired ]]°| + | [[bidsFor]]’ |. If 
there are 20, 40, 80 instances of the properties purchased, 
acquired and bidsFor respectively, with a total of 1000 
property instances in the description base, then the specificity of 
purchased is 0.02 while its 0-specificity is 0.143. 

We can extend these ideas to capture the information content of 
a p-path association. Let PS = pj, P2, ..., Pma be a property 
sequence and ps e€ [[PS]] a path. It is clear that ps occurs as 
frequently as the least frequently occurring property p;in PS. We 
define the information content of ps due to its specificity as 
Ig(ps) = max{ Ig(p;)}. Intuitively, this means that a path is as 

1 


informative as its most informative edge with respect to the 
entire description base. 


For the information content of ps due its 0-specificities we must 
somehow combine the different 9-specificities of the various 
properties on the path. However, since the 0-specificities of the 
edges are based on different probability distributions, we cannot 
meaningfully compared them with one another, so we must first 
normalize the values by taking the ratio of the observed 
information content to the maximum possible information 
content (as described earlier). This results in normalized @- 
specificity values, NIg.. It is important that our combination 
function doesn’t bias towards longer paths, therefore a sum 
function is not a good combination function. Also, we must 
ensure that the combination function distinguishes between 
paths with more uniform 0-specificity distributions than those 
with non-uniform @-specificity distributions. To see why this is 
so, take for example two paths ps, and psz that have the same 
average NIg.s, but ps; has 0-specificity values for all its edges 
about equal while ps, has a range of 0-specificity values for its 
edges from low to high. Then it seems that ps, that has some 
edges with low information content (i.e., some weak links) 
which should be easier to predict than ps, which has all its edges 
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with an equal level of predictability. Therefore, to measure the 
information content of ps with respect to the @-specificity of the 
properties Pi, P2, ..., Pn, we modify the value gotten from the 
average of the 0-specificities to: 


> Nlo.s(Pi) |-min{NIo.s(Pi)} 
L | 


Ip.s(Ps) = min{NIo.s (p; )}+— 
Vi n-l 

This implies that information content due to 0-specificity is a 

combination of the information gained from the weakest point 

along the path and an average of the rest. 


Finally, to the get the total information gained by knowing a 
path occurred we combine the values of information content due 
to both specificity and 0-specificity: 


I(ps) = Is(ps) + Ie.s(ps). 


3.2 Refraction 

As mentioned earlier, the multiple classification of resources 
allowed in the Semantic Web data models like RDF can create 
paths at the description layer the do not occur at the schema 
layer, especially when multiple schemas are used to describe a 
set of resources. We use the term Refraction to refer to a 
deviation from a path’s representation at the schema layer. In 
other words, a description layer path starts and proceeds along 
exactly as described at the schema layer and then changes 
direction or refracts at some point. 

In order to make this notion of refraction more precise, we need 
a representation of all the paths that are possible based on what 
is explicitly defined in or inferable from a schema. Then for a 
given path in the description base, any sequence of edges not 
represented would be considered a refraction due to a multiple 
classification of a node. We propose the notion of a Semantic 
Summary as such a representation. It is analogous to the 
concept of DataGuides [10] and other structural summaries [17] 
used to optimize the evaluation of path expression queries in 
semi-structured data models. A semantic summary is a graph in 
which the vertices are ROCs. 

Definition 2 Semantic Summaries. A Semantic Summary 
for an ontology O with sets C/P of classes/properties is a graph 
Gs = (Vs, Es, A, <) 

1. Vg is the set of ROCs of O as defined in definition 1. 

2. Es= { (x,y) IS ico.) FO and x, ye Vet 


3. A: Eg >2? 

i. (x, y)e Eg, A(x, y) = semLinks(x, y) 

4. < is a subsumption relation on nodes in Vg such that for 
two ROCs x and y, x is rdfs:subclassOf y if for c; € x, de; 
€ y such that ci rdfs:subClassOf cj. 

Two edges (u, v) and (w, x) of a semantic summary are said to 

be adjacent if v =w. Given a semantic summary Gs = (Vs, 

Es, A, < ) and a property sequence PS = pj, po, ..., Pn if ei, €j 

€ Es and e; and ej are not adjacent in Gs and p; € Mei) and pi € 

X(e;), then we say that there is a refraction from p; to pi. 

Formally, for a path sequence PS = pi, po, ..., Pm 

refraction(pi p) = 


1 if dee, such thate,is adjacenttoe, ^ p,€ Ale) A PaE Ale, ) 
0 otherwise 


We use the term refraction count RC to refer to the 
number of refractions on a path, given by 


n-l 


> refraction( Pis Pins) 


RC(PS) == for n 22 , 0 otherwise. 
For example, in Figure 2, the path = pros 
&rl depositsInto > &r8 accountHolder 5 &r9 electedLeader > &r6 
with the property sequence PS = 


depositsIntoeaccountholdereelectedLeader, 
refracts from accountholder to electedLeader because the 
resource &r9 is multiply classified as an instance of both 
Organization and Customer and RC(PS) = 1. 


3.3 S-Match 


In order to integrate IR style search with p-queries, we allow 
users to augment their queries with keywords. A Semantic 
Match (match of property or super/subproperty) of a keyword 
and a property occurring in Semantic Association) increases the 
rank value for that Semantic Association. The degree of the 
match and hence its S-Match value is determined by the 
proximity of the properties in the property hierarchy. This 
approach is very similar to that used in determining the 
similarity of concepts in an ontology [16] [21]. Given a property 
sequence PS = pj, po, p3 ... Pn and a set of keywords K = {k;, kz, 
k; ... km}, the degree of a match between k; and pj; is given by 
SemMatch(k, p) = 0 < (2°)! < 1, where d is the minimum 
distance between the properties in a property hierarchy. If two 
keywords match on the same property we take the maximum of 
their SemMatch values. Then for a path ps €e [[ PS ]], its S- 
Match value is given by 


nk 

S - Match(ps) = > max{SemMatch(p;, k ;)} 
= jl 
i=l 


For example, in Figure 2 above, the minimum distance between 
“audits” and “enrolls” is 1, so that SemMatch(audits, 
enrolls) = 4. Given K = {audits, taughtBy} and a property 
sequence PS = enrollsetaughtBy, S-Match(ps) = %2 + 1 = 
1%. 


3.4 SemRank 


All the factors discussed above when combined together give 
the SemRank value of a Semantic Association. However, the 
exact nature in which they are combined is dependent on the 
search mode u which varies from 0 to 1, with 0 indicating purely 
Conventional and 1 indicating purely Discovery modes, 
respectively. Based on this, we build a modulative model for 
SemRank in which the mode u specifies how each of the factors 
contribute to the rank of an association. The query mode u 
modulates the contribution of the information content of an 
association to its rank as shown below: 


I, (ps) = (1-WUP8)) + pps) 
This leads to higher rank values being assigned to the most 


unpredictable paths at the purely discovery mode, and lower 
rank values being assigned at the purely conventional mode. The 
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query mode u modulates the refractive count of an association as 
shown below RC, (ps) = URC(ps). 


Since predictable paths are desired at the purely conventional 
mode, paths ranked highest at this mode do not have refractions, 
as the formula above shows. Both the purely discovery and 
purely conventional modes seek to retrieve paths whose 
component properties have high S-Match values with the 
keywords provided by the user. Therefore, S-Match is not 
modulated by u. The SemRank formula combines these three 
factors to assign a rank to any Semantic Association, adapting 
itself as the mode changes. It is defined for a p-path association 
ps as: 


SEMRANK(SA) = I, (ps) x (1+RC, (ps)) x (1+S-Match(ps)) 


Using this model, we can provide a flexible ranking approach 
for ranking complex relationships. 


4 ORDERING SEARCH RESULTS 
USING SEMRANK VALUES 


The approach for obtaining an ordering on Semantic 
Associations resulting from a search will depend largely on the 
strategy for computing SemRank values. Possible strategies 
include integrating query processing, SemRank computation and 
result ordering into a single phase or performing the last two 
steps in a separate phase after query evaluation. The choice of 
the strategy to be adopted is dependent on whether exact 
orderings are required or whether approximately correct 
orderings are acceptable. In the case of approximately correct 
orderings, we trade correctness for efficiency. This happens 
because in the case of exact orderings it may be necessary to 
completely compute the SemRank values of all Semantic 
Associations and then sort them in order. When there is a large 
number in the result set, this may prove to be inefficient. 


In this section, we will discuss an approach for computing 
SemRank values for Semantic Associations and an approximate 
Top-K ordering algorithm used in our SSARK system. 


4.1 Overview of the SSARK System 

The approach used in the SSARK prototype system 
implementation consists of three phases supported by the 
architecture shown in Figure 3. 


~ = 


X NINN Y 


User SubSystem 


RDF 
[Documents 
Query & Result 

Interface 


Loader 


Pipelined 
top-k 
results 


¥ 


-— 


Preprocessor 


c 
Storage Manager | 


< 


Tndex Manager 


Ranking 


EDIE Engine 


* PHIX 
ROIX 


Figure 3: Architecture of the SSARK system 


In the preprocessing phase, RDF documents are loaded 
and preprocessed into an intermediate representation by the 
Loader and Preprocessor. The intermediate 
representation of an RDF graph produced by the preprocessor is 
called a path sequence. A path sequence is a sequence of 
subgraph representations that is amenable to efficient query 
evaluation. The persistence of a path sequence is managed by 
the Storage Manager which allows for its storage in a 
database. When a query with a pair of resources is given, the 
Query Processor selects relevant subsequences of a path 
sequence and composes them to generate an annotated summary 
of the Semantic Associations called an Annotated Path 
Expression Tree (APET). The discussion of query 
evaluation is outside the scope of this paper. For the sake of 
brevity, the rest of the discussion will focus on ranking only p- 
pathAssociations, even though the other types of associations 
have also been investigated and are being prototyped. 


The APET generated by the query processor is a tree 
representation of the regular expression that represents all paths 
found between the resources specified in the query. It is a K- 
ary and-or tree where leaves are the labeled edges of the 
paths and internal nodes are operator (union, concatenation) 
nodes with K children. Figure 4 shows an example APET. A 
semantic transformation process is performed during query 
evaluation to ensure that cycles are not represented in an APET. 
A discussion of the semantic transformation process is outside 
the scope of this paper but can be summarized thusly: For any 
cycle c with paths (a) from vertex vi to vertex v2 and (b) from 
v2 back to vi, c is be broken up into two paths from v; to vz. 
The first path is equivalent to (a) and the second is equivalent to 
((b)')® i.e. the reverse of the path (b) back to v; with the 
properties in (b) substituted with their inverse properties. Also 
during query evaluation the leaves of an APET (1.e., path edges) 
are annotated with their SemRank values. Then during the 
ranking phase, the Ranking Engine uses a pipelined Top-K 
algorithm to extract approximately the Top-K paths represented 
in the summary. The sequel elaborates on the structures used to 
support the SemRank computation as well as the Top-K 
algorithm. 


4.2 Annotating Path Expression Trees 

During query evaluation, the leaves of an APET are annotated 
with a set of values that contribute to their SemRank values. 
These values are either retrieved directly or computed from 
indexes in the Index subsystem. We will now summarize 
the roles of the indexes used in the computation of SemRank 
values: 


e Frequency Distribution IndeX (FDIX): FDIX 
maps each property p to a tuple (| [[ p 1], | [Lp I|) where | 
[[ p II | is the size of p’s proper extent and | [[ p ]]" | includes 
the size of the proper extent of p’s superproperties. These 
values are used for calculating Specificity and 0-Specificity. 


e Representative Ontology IndeX (ROIX): The 
Representative Ontology Index is a hierarchical index that 
maps resources to classes and then classes to ROCs. It also 
stores the semLinks that link the ROCs, i.e., the labels on the 
edges linking the ROCs in the semantic summaries. This 
information is used to determine the refraction count of a path. 


123 


e Property Hierarchy IndeX (PHIX): Each property 
in the RDF data model may participate in a number of 
subsumption hierarchies. The idea is to index these properties 
in such a way that the distance between two entities in the 
hierarchy can be measured in constant time. To this effect, we 
index the properties in each hierarchy in a manner similar to 
the Dewey Decimal Coding (DDC) where each node is 
assigned an id that preserves its relative position amongst its 
siblings, prefixed by the id of its parent. For all the ids of all 
properties to be unique, each hierarchy is assigned a hierarchy 
id. Determining the distance between two properties then 
amounts to summing up the number of strings in the two ids 
beyond their Least Common Ancestor. For example given the 
ids 0.1.3.4.5.6 and 0.1.3.4.8, their LCA is 0.1.3.4. Beyond 
this, the first id has two strings (5.6) and the second has one 
(8), so the distance between them is three. Because a property 
may have more than one id, (since it may participate in more 
than one subsumption hierarchy), PHIX maps every property 
p to a set of ids. Using this index we can efficiently measure 
the distance between a keyword given a query and the 
properties on the resulting path which determines the 
SemMatch value of the path. 


4.3 Retrieving Top-K results 

After the query evaluation has returned an APET, the ranking 
engine extracts the top-K paths represented in the APET. An 
exact SemRank ordering may require an exponential time 
algorithm since all paths must be assigned a value first before 
the paths are ordered. Here, we use a practical approach that 
finds an approximately correct ordering thereby 
sacrificing total correctness for efficiency. The algorithm 
proceeds in two phases. In the first phase, the top-K paths are 
computed based on all values except the refraction count values. 
Then the second phase reranks the paths from the first phase 
based on their refraction count. It is clear that this will not 
always results in the totally correct SemRank ordering, but we 
expect that what we get is an approximation that is suitable for 
most applications. The reason for this is that during the top-K 
computation as will be discussed shortly, paths are composed 
iteratively into subpaths in a bottom up manner from the leaves 
so that the entire path is composed when the iteration is at the 
root of the APET. This means that properties of a path that are 
used in SemRank computation such as the refraction count can 
only be computed at the end of a path building phase as opposed 
to the other factors (e.g. specificity) that are properties of the 
edges themselves and are known once an edge is encountered. 


The algorithm proceeds bottom-up computing top-k paths for 
nodes based on the top-K paths computed for its children. Each 
non-leaf node maintains a list for storing its Top-K paths, as 
well as a max-priority queue implemented using the heap data 
structure with which it orders its Top-K paths. The idea is to 
obtain the Top-K paths for each node by accessing only a 
minimal prefix of the Top-K paths of its children nodes, which 
are ordered in non-increasing order of SemRank values. At an 
‘or’ node, during the first iteration of the algorithm, the first 
path from each child’s Top-K paths are first enqueued into the 
max-priority q, then k paths are extracted from the queue. For 
any path extracted from the queue, the next path from its list is 
enqueued. For subsequent iterations, we update the queue with 
the first new entry of a list if it was updated after its previously 


last entry had been enqueued. As such, not more than k paths are 
accessed from each child’s list during each iteration of the 
algorithm. On the other hand, at an ‘and’ node, during the first 
iteration of the algorithm, the first path from each list is 
concatenated (preserving the order of the lists) to obtain the first 
of the k paths. To obtain the remaining k-1 paths, we first 
initialize the queue by obtaining and enqueueing a 
concatenation of the second path from each list with the first 
path from all other lists, then we extract k-1 paths from the 
queue. For each path p; extracted from the queue, if p; is a 


concatenation of paths eve Peewee ed from lists 1,, l2, ... lk 


respectively (where | ik refers to the jth path from list k), we 


create a new path concatenating paths 


Pin by 


lL il polok (where Lg is equal to lix if k 


equals m, and 1 jk if k is not equal to m). Having created the 
path p;,;, we only insert it into the queue if 


e there does not exist any path pj = piı with l, x equal to 


1 jk when k equals m that is still in the queue and 


e pin is not already in the queue. 


For other iterations, the queue may need to be re-initialized if it 
is empty. This algorithm is analogous to the ranked join 
algorithms described in [19] except that we have taken some 
measures to optimize queue operations. All possible paths have 
been extracted from the queue of both the ‘and’ and ‘or’ nodes 
when these queues become empty. In general, this technique can 
be applied to retrieve ranked paths from any tree representation 
of path expressions, irrespective of the relevance model used for 
ordering, as long as a monotone combining function is used in 
the join step. In the second phase, the rank of the Top-K paths 
retrieved from the first phase are re-computed this time 
including the refractive index of the paths, then the paths are re- 
ordered based on the new rank values. 

To illustrate this, suppose we want to retrieve the top-2 paths 
from the APET shown in Figure 4(a) using the approximate 
retrieval technique. 


Figure 4: APET showing Top-K evaluation 
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For this example, we obtain the rank of a path by summing the 
ranks of its sub-paths. Figure 4(b) shows the state of the APET 
after all sub-trees of height 1 have been processed. Processing 
continues in a similar manner until the list of the root node of 
the APET gets updated with the Top-K paths. Figure 4(e) shows 
the state of the APET after the first ranking phase. The top-2 
paths c.f and g.i both have the same ranks (9). If we assume that 
there is refraction from g to i and none from c.f, then during the 
second ranking phase, the ranks of these paths would be re- 
computed to incorporate refraction. The new rank of g.i would 
then be 18 (9 x 2), so that after they are re-ordered, g.i precedes 
c.f. 


Given an APET, let R = py, po, ..., Pn be all paths in the APET in 
non-increasing order of SemRank and let R’ -= by, bs, ..., py be 
the Top-K paths obtained from the APET using the approximate 
retrieval technique. We define the approximation error or cost of 
approximation as the average distance between the index of a 
path in R’ and its occurrence in R. This is given by 


5 EMPIRICAL EVALUATION 


There are an increasing number of publicly available RDF data 
sets ranging from those that are narrowly focused (e.g. DBLP 
[30], ODP [31]) to those with broader scope covering multiple 
domains (e.g. TAP [29], SWETO [28]). However, most of these 
presented limitations that made them unsuitable as evaluation 
testbeds for SemRank. Because the SemRank is property-centric 
it was important that the testbeds have a wide variety of 
relationships. This was not the case in the narrowly focused 
ontologies, where most relationships tend to be of the same 
kind. In such a situation, there are little or no distinguishing 
features between the relationships. This was also present to 
some extent in the broader ontologies because they tended to be 
fragmented into smaller focused ontologies with limited 
connections between them. Also most of the ontologies were 
organized as hierarchies so that most of the relationships 
represented were inheritance relationships. Another important 
issue is that evaluation testbed(s) must have data distributions 
that model reality. This is because if a testbed’s data 
distributions are skewed merely because the data collection 
process was not comprehensive then a SemRank value for a 
result may be meaningless (e.g. a property may be treated 
incorrectly as being rare even though the low frequency was due 
to the incompleteness of data). These problems are merely a 
reflection of the early developmental stages of testbeds for the 
Semantic Web. We expect that some of these ontologies will 
evolve into rich collections that may be very useful for future 
evaluations. 


Consequently, the evaluation of SemRank discussed here was 
done on synthetically generated data. The data generation was 
guided by rules to ensure that data distributions mirror the real 
world. For example, in generating data relating to Students and 
Courses, it is often the case that the total number of students 
merely auditing the class, are less than 10% of the enrolled 
students. Our sample data builds on the schemas used in the 
example in Figure 2 involving the University, Banking, Flight 


and Organization domains. An example query for the Semantic 
Associations between the resources rl (Sarah White) and r6 
(Zachary Black) is shown in Figure 5. It shows the simple query 
interface (sliding bar) that can be used easily to adjust search 
modes without users having to manipulate different criteria 
values. The same results are shown in using different search 
modes Figure 6 (purely conventional = 0), Figure 8 (purely 
discovery u = 1.0) and Figure 7 (in between i.e., u = 0.5). A 
close examination of the results provide a justification for a 
modulative relevance model. 


QUERY PROCESSOR — 


Query Specification 


Mode: |0 |% 


Conventional 


Discovery 


Enter the two resources 


r1 | [re 


Relevance Specification 


| Search | | Reset | | Close | | Next | 


Figure 5: Query Interface 


In addition, we can see that different orderings map closely to 
our intuition. The conventional search mode shown in Figure 6 
returns as the first two results the paths of length = 1, both of 
which have the same number of possible valid properties in the 
schema (2 in this case. adviseeOf and TAOf for the first 
result, worksFor and ownsStockIn for the second). 


E auey processor “t E 


Results for 
p(rd(Surah White), raf Zachary lack) 


rlfSarah Whit) adviseeOF rf(Zachary Bac) 19360711 


rlfSarah White) worksFor (Zachary Black) 17666577 


ranh White) purchased (48697) forFlicht, r7(AA203) paidForBy r10(#410734587656232) accountHolder r12(Mirage 
Corporations) electedLeader rú Zachary Black) 


27762812 


ri(Sarah White) depositsInto r5(acct30976903) accountHolder Riverside Inc. hasStockOwner ré(Zachary Black) 14349041 


rl(Sarah White) dapositsInto r4(acct30976847) secondaryAccountHolder —ré(Zachary Black) 13494822 


rl(Sarah White) depositsInto r3(acct39970090) accountHoldar rl4(Benve Marks Inc) electedLeader ri(Zachary Black) 11514465 


rl(Sarch White) memberOf rlé(Apex Inc.) electedLeader ré(Zachary Back) 0.8778786 


rifSarh White) audits r17(CS6540) taughiBy (Zachary Back) 17923 


rl(Sarah White) owmsStockIn rll(DynaLinks Inc.) electedLeader ri(Zachary Black) 0.5670579 


Figure 6:Results of Search at u = 0 (Conventional Mode) 
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These results (advisee-adviser relationship between a 
student and a professor than an employee-employer 
relationship) map to our intuition as the most predictable 
relationships. 


As earlier mentioned, SemRank orderings are independent of 
path lengths. For example the longest path which comprises 
both very common edges and very rare ones, with more of the 
former than the latter, is ranked third. This path which lies in 
between very rare and very common, is ranked first at u = 0.5, as 
shown in Figure 7 below. 


(E QUERY PROCESSOR io K 

È 

Results for 
piril Sarak White), rof Zachary Black) 
ol + tl(Sarah White) purchased 8697) forFlight 17(AA203) paidForBy r10(#4410234587656237) accountHolder r1XMirage 1715502 
Corporations) electadLeader ri(Zachary Black) 

(1) | =» Searah White) adviseeOt rólZactary Bick) 16389338 
els ri(Sarah White) depositsInto rõ(acct:30976903) accountHolder Riverside Inc. hasStockOwner fi Zachary Black) = 
eal ri(Sarai White) worksFor ri(Zachary Black) 15640697 
(9) | — | rl(Sarah White) depositsInto r3(acct:39970090) accouniHolder +14/BenueMarks Inc) electedLeader rf(Zachary Black) re 
(| = (raah White) owmsStockIn 11 1(DynaLinks Inc.) electedLeader rô Zachary Black) 11652732 
(3) | > |rl(Sareh White) deposisInto rf(acct'39976847) secondaryAccounHokder (Zachary Bick) 10498426 
HA ri(Sarah White) audits r17(C36540) taughtBy ri(Zachary Black) lal 
(6) | —» | 11(Sareh White) memberOf rlóçAgezIne.) electedLeader ré(Zachary Black) 1.004941 


Figure 7: Results of Search at u = 0.5 


Ordinarily, one might expect that the orderings of results at the 
purely conventional (u = 0) and purely discovery (u = 1) modes 
would be exact inverses of each other. However, this is not 
always the case since refraction only plays a role at the 
discovery end of the search spectrum. 


Ejouerrpracessor Eg Vl 
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(4) | > |rl(Sarah White) depositsInto 15acct:39976903) aceountHolder Riverside Inc. hasStockOwmer ró(Zachary Black) (3937339 
ol > ri(Sare White) purchased 183697) forFlight 17(AA202) paidForBy r10(4410234587656222) accoumtHolder 1 2(Mirage 1 3179301 
Corporations) electedLeader ri(Zachary Black) 
(5) | > |ri(Sarah White) audis r1{C865541) tanghiBy  r6(Zachary Black) (260148 
(© | > |rlGaeh White) memberOf rle(Aperinc) electedLeader r6(Zachary Black) 1.139107 
G) | — |ri(Sarah White) depositsInto 14(acct: 10976847) secondaryAecouniHolder ri(Zachary Back) 0.730203 
a > |rl(Gach White) worksFor 6(Zachary Black) (3614417 
(1) | > |ri(Garah White) adviseer 1é6(2actary Black) 03405905 


Figure 8: Results of Search at u = 1 (Discovery Mode) 


For example, the ordering of path 1 and 2 at the purely 
discovery mode (ninth and eighth, respectively) is the exact 
inverse of their ordering at the purely conventional mode (first 
and second, respectively) because there is no refraction along 
both these paths. However the orderings of paths 7 and 4 
(ranked third and fourth respectively at the purely conventional 
mode and fourth and third respectively at the purely discovery 
mode) conform to this intuition locally but not globally, since 
refraction occurs along both paths. 


As mentioned earlier, users are allowed to augment their queries 
with keywords. To illustrate the effect any the use of keywords 
may have, Figure 9 below shows the same query at the purely 
conventional mode but now with the keywords {enrolls, 
depositsInto} supplied. 
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(| — | rlSarch White) depositsInto rf(acct'39976847)  serondaryAccouniHolder réiZachary Black) 17289644 
| > [r(SaahWhte) depositsInto r3(acct39970000) accountHolder rl4(Benue Marks Inc) electedLeader rú(Zachary Black) 23028929 
rl(Sarah White) purchased 18(43597) forFighi 11(AA203) paidForBy r10#4410234587656232) accountHolder r12(Nirage 


2.2762812 
Corporations) electedLeader rf(Zachary Slack) 


1](Sarah White) 


audits ri7(CS6540) taughiBy  r6(Zachary Black) 11984499 
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1i(Sarah White) ownsStockIn rll(Dynalinks Inc) elaciedLender ré(Zachary Black) 05670579 


Figure 9: Results of Search at u = 0 with keywords “enrolls” 
and “depositsInto” 


Note that the keyword depositsInto has a higher S-Match 
value than enrolls since audits, a sub-property of 
enrolls appears along a path in the result, instead of 
enrolls. 


With the S-Match value for these keywords, paths 4, 3, 9 and 5 
ranked fourth, fifth, sixth and eighth respectively in Figure 6 are 
ranked second, fourth, fifth and seventh respectively in Figure 9. 


These examples show that as long as the distribution of data 
reflects real world situations, the ordering of the results should 
be fairly close to user expectations. 


6 RELATED WORK 


[11] presents the idea of semantic search aimed at improving 
searches of documents on the Semantic Web by augmenting the 
results of traditional searches with relevant data obtained from 
multiple sources on the Semantic Web. [9] seeks to find 
important related nodes to a given set of keywords using a 
spread activation mechanism that is guided by information or 
knowledge provided by a domain expert/knowledge engineer. 
As we have discussed, discovering and ranking relationships 
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presents other challenges. In [26], the authors introduced the 
idea of ranking resources on the Semantic Web based on a set of 
semantic links that are used to describe the semantic 
relationships amongst the resources. Our idea of using the 
similarity of keywords provided by the user to enhance the 
importance of a Semantic Association is in a way similar to this. 
[25] presents an interesting ontology-based ranking scheme for 
ranking entities on the Semantic Web. This scheme determines 
the relevance of the entities based on their specificity. In our 
scheme we have also used the notion of specificity as a measure 
of the relevance of Semantic Associations. However our 
measurement of specificity of an association is different from 
that used in this work. Furthermore, our work differs from the 
above two in two significant ways. First, we focus on ranking 
Semantic Associations that exist between entities and not the 
entities themselves. Second, we focus on providing a flexible 
approach as opposed to a fixed approach to ranking these 
associations. In this way, the associations can be ranked to meet 
the needs of the user. To the best of our knowledge, the issue of 
ranking Semantic Associations on the Semantic Web has only 
been addressed by our colleagues in [1]. This work describes 
several criteria upon which relevance of Semantic Associations 
between entities are determined. Although the approach taken is 
flexible, it involves a user having to specify parameters for each 
of the criteria which can be an overwhelming task. Furthermore, 
our work significantly advances the understanding of the nature 
of the results. 


7 CONCLUSION AND FUTURE WORK 


We have presented an approach and framework for ranking 
complex relationships resulting from a “relationship search”. 
Although an empirical evaluation was done using synthetically 
generated data due to the limitations of existing RDF data 
collections, it sufficed to show the justification for a flexible 
ranking approach that provides a variety of result orderings that 
a user may choose from, as opposed to locking users into a 
particular ranking scheme that may be unsuitable for the needs. 
In the next phase of our work, we will focus on performing 
evaluations on real world data. 
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